Versioning Example (Part 2/3)

In part 1, we trained and logged a tweet sentiment classifier using ModelDB's versioning system.

Now we'll see how that can come in handy when we need to revisit or even revert changes we make.

This workflow requires verta>=0.14.1 and spaCy>=2.0.0.


Setup

As before, import libraries we'll need...


In [1]:
from __future__ import unicode_literals, print_function

import boto3
import json
import numpy as np
import pandas as pd
import spacy

...and instantiate Verta's ModelDB Client.


In [2]:
from verta import Client

client = Client('https://app.verta.ai')
proj = client.set_project('Tweet Classification')
expt = client.set_experiment('SpaCy')


set email from environment
set developer key from environment
connection successfully established
set existing Project: Tweet Classification from personal workspace
set existing Experiment: SpaCy

Prepare Data

This time, things are a little different.

Let's say someone has provided us with a new, expermental dataset that supposedly will improve our model. Unbeknownst to everyone, this dataset actually only contains one of the two classes we're interested in. This is going to hurt our performance, but we don't know it yet.

Before, we trained a model on english-tweets.csv. Now, we're going to train with positive-english-tweets.csv.


In [3]:
S3_BUCKET = "verta-starter"
S3_KEY = "positive-english-tweets.csv"
FILENAME = S3_KEY

boto3.client('s3').download_file(S3_BUCKET, S3_KEY, FILENAME)

In [4]:
import utils

data = pd.read_csv(FILENAME).sample(frac=1).reset_index(drop=True)
utils.clean_data(data)

data.head()


Out[4]:
text sentiment
0 Sean Lock is awesome !! ... I love Family Guy ... 1
1 Date night with Jared! At the movies! 1
2 ohac track 2 Tirthankar says you should partic... 1
3 long weekend 1
4 Drawing. Slightly irritated. Oh well nothing I... 1

Capture and Version Model Ingredients

As with before, we'll capture and log our model ingredients directly onto our repository's master branch.


In [5]:
from verta.code import Notebook
from verta.configuration import Hyperparameters
from verta.dataset import S3
from verta.environment import Python

code_ver = Notebook()  # Notebook & git environment
config_ver = Hyperparameters({'n_iter': 20})
dataset_ver = S3("s3://{}/{}".format(S3_BUCKET, S3_KEY))
env_ver = Python()  # pip environment and Python version



In [6]:
repo = client.set_repository('Tweet Classification')
commit = repo.get_commit(branch='master')


set existing Repository: Tweet Classification from personal workspace

In [7]:
commit.update("notebooks/tweet-analysis", code_ver)
commit.update("config/hyperparams", config_ver)
commit.update("data/tweets", dataset_ver)
commit.update("env/python", env_ver)

commit.save("Update tweet dataset")

commit


Out[7]:
(Branch: master)
Commit 62d20618f919d6ebaa389caea39e3cf27cad6e7cc5b18cc9248935e2432da27d containing:
config/hyperparams (Hyperparameters)
data/tweets (S3)
env/python (Python)
notebooks/tweet-analysis (Notebook)

You may verify through the Web App that this commit updates the dataset, as well as the Notebook.


Train and Log Model

Again as before, we'll train the model and log it along with the commit to an Experiment Run.


In [8]:
nlp = spacy.load('en_core_web_sm')

In [9]:
import training

training.train(nlp, data, n_iter=20)


Using 16000 examples (12800 training, 3200 evaluation)
Training the model...
LOSS 	  P  	  R  	  F  
0.215	1.000	1.000	1.000
0.001	1.000	1.000	1.000
0.000	1.000	1.000	1.000
0.000	1.000	1.000	1.000
0.000	1.000	1.000	1.000
0.000	1.000	1.000	1.000
0.000	1.000	1.000	1.000
0.000	1.000	1.000	1.000
0.000	1.000	1.000	1.000
0.000	1.000	1.000	1.000
0.000	1.000	1.000	1.000
0.000	1.000	1.000	1.000
0.000	1.000	1.000	1.000
0.000	1.000	1.000	1.000
0.000	1.000	1.000	1.000
0.000	1.000	1.000	1.000
0.000	1.000	1.000	1.000
0.000	1.000	1.000	1.000
0.000	1.000	1.000	1.000
0.000	1.000	1.000	1.000

In [10]:
run = client.set_experiment_run()

run.log_model(nlp)


created new ExperimentRun: Run 421651584660604710162
upload complete (custom_modules.zip)
upload complete (model.pkl)
upload complete (model_api.json)

In [11]:
run.log_commit(
    commit,
    {
        'notebook': "notebooks/tweet-analysis",
        'hyperparameters': "config/hyperparams",
        'training_data': "data/tweets",
        'python_env': "env/python",
    },
)

Revert Commit

Looking back over our workflow, we might notice that there's something suspicious about the model's precision, recall, and F-score. This model isn't performing as it should, and we don't want it to be the latest commit in master. Using the Client, we'll revert the commit.


In [12]:
commit


Out[12]:
(Branch: master)
Commit 62d20618f919d6ebaa389caea39e3cf27cad6e7cc5b18cc9248935e2432da27d containing:
config/hyperparams (Hyperparameters)
data/tweets (S3)
env/python (Python)
notebooks/tweet-analysis (Notebook)

In [13]:
commit.revert()

commit


Out[13]:
(Branch: master)
Commit 0760381007ec1f0f9a452b7d61c2d385476c6a6727b3aaac028c6abb53417010 containing:
config/hyperparams (Hyperparameters)
data/tweets (S3)
env/python (Python)
notebooks/tweet-analysis (Notebook)

As easy as that—we have a new commit on master that reverted our grave mistake.

Again, the Web App will show that the change from english-tweets.csv to positive-english-tweets.csv has been undone.